<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>使用停用词 | Elasticsearch: 权威指南 | Elastic</title>
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
</head>
<body>
<div class="main-container">
    <section id="content">
        
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原文地址: <a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/using-stopwords.html" rel="nofollow">https://www.elastic.co/guide/cn/elasticsearch/guide/current/using-stopwords.html</a>, 版权归 www.elastic.co 所有<br/>
                            英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html</a>
                            </div>
                        <!-- start body -->
                  <div class="page_header">
<b>请注意:</b><br>本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch: 权威指南</a></span>
»
<span class="breadcrumb-link"><a href="languages.html">处理人类语言</a></span>
»
<span class="breadcrumb-link"><a href="stopwords.html">停用词: 性能与精度</a></span>
»
<span class="breadcrumb-node">使用停用词</span>
</div>
<div class="navheader">
<span class="prev">
<a href="pros-cons-stopwords.html">« 停用词的优缺点</a>
</span>
<span class="next">
<a href="stopwords-performance.html">停用词与性能 »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="using-stopwords"></a>使用停用词(<code class="term">stopwords</code>)<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/240_Stopwords/20_Using_stopwords.asciidoc">edit</a>
</h2>
</div></div></div>
<p>移除停用词的工作是由<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-stop-tokenfilter.html" class="ulink" target="_top"><code class="literal">stop</code>词元过滤器(token filter)</a>完成的，可以通过创建自定义的分析器来使用它。但是，也有一些自带的分析器预置了<code class="literal">stop</code>过滤器：</p>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-lang-analyzer.html" class="ulink" target="_top">语言分析器(language analyzers)</a>
</span>
</dt>
<dd>
每个语言分析器默认使用与该语言相适的停用词列表，例如：<code class="literal">english</code> 英语分析器使用 <code class="literal">_english_</code> 停用词列表。
</dd>
<dt>
<span class="term">
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-standard-analyzer.html" class="ulink" target="_top">标准分析器(<code class="literal">standard</code> analyzer)</a>
</span>
</dt>
<dd>
默认使用空的停用词列表：<code class="literal">_none_</code> ，实际上是禁用了停用词。
</dd>
<dt>
<span class="term">
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-pattern-analyzer.html" class="ulink" target="_top">模式分析器(<code class="literal">pattern</code> analyzer)</a>
</span>
</dt>
<dd>
默认使用空的停用词列表：为 <code class="literal">_none_</code> ，与 <code class="literal">standard</code> 分析器类似。
</dd>
</dl>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_停用词和标准分析器stopwords_and_the_standard_analyzer"></a>停用词(<code class="term">stopwords</code>)和标准分析器(<code class="literal">standard</code> <code class="term">analyzer</code>)<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/240_Stopwords/20_Using_stopwords.asciidoc">edit</a>
</h3>
</div></div></div>
<p>为了让<code class="literal">standard</code>分析器使用自定义的停用词，我们只需在创建一个分析器时使用已配置好的分析器，然后传入停用词列表：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { <a id="CO167-1"></a><i class="conum" data-value="1"></i>
          "type": "standard", <a id="CO167-2"></a><i class="conum" data-value="2"></i>
          "stopwords": [ "and", "the" ] <a id="CO167-3"></a><i class="conum" data-value="3"></i>
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO167-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>自定义的分析器名称为 <code class="literal">my_analyzer</code> 。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO167-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p>这个分析器是一个使用了一些自定义配置的标准(<code class="literal">standard</code>)分析器</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO167-3"><i class="conum" data-value="3"></i></a></p>
</td>
<td align="left" valign="top">
<p>过滤掉的停用词包括 <code class="literal">and</code> 和 <code class="literal">the</code> 。</p>
</td>
</tr>
</table>
</div>
<div class="tip admon">
<div class="icon"></div>
<div class="admon_content">
<p>任何语言分析器都可以使用相同的方式配置自定义停用词。</p>
</div>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="maintaining-positions"></a>保持词元在原内容中的位置（maintaining positions）<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/240_Stopwords/20_Using_stopwords.asciidoc">edit</a>
</h3>
</div></div></div>
<p><code class="literal">analyzer</code> API的输出结果很有趣:</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead</pre>
</div>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">{
   "tokens": [
      {
         "token":        "quick",
         "start_offset": 4,
         "end_offset":   9,
         "type":         "&lt;ALPHANUM&gt;",
         "position":     1 <a id="CO168-1"></a><i class="conum" data-value="1"></i>
      },
      {
         "token":        "dead",
         "start_offset": 18,
         "end_offset":   22,
         "type":         "&lt;ALPHANUM&gt;",
         "position":     4
      }
   ]
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO168-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">position</code>标记了每个词元(<code class="term">token</code>)的位置。</p>
</td>
</tr>
</table>
</div>
<p>如我们期望的那样, 停用词被过滤掉了，但有趣的是这两个词项(<code class="term">term</code>)的位置(<code class="term">position</code>)没有变化：<code class="literal">quick</code> 是原句子的第二个词，<code class="literal">dead</code> 是第五个。这对短语查询(<code class="term">phrase query</code>)十分重要，因为如果每个词项的位置被调整了，一个短语查询<code class="literal">quick dead</code>会与上例中的文档错误的匹配。</p>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="specifying-stopwords"></a>指定停用词（<code class="term">specifying stopwords</code>）<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/240_Stopwords/20_Using_stopwords.asciidoc">edit</a>
</h3>
</div></div></div>
<p>停用词可以以内联的方式传入，就像我们在前面的例子中那样，通过指定数组:</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">"stopwords": [ "and", "the" ]</pre>
</div>
<p>特定语言的默认停用词，可以通过使用 <code class="literal">_lang_</code> 符号来指定:</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">"stopwords": "_english_"</pre>
</div>

<div class="tip admon">
<div class="icon"></div>
<div class="admon_content">
<p>Elasticsearch 中预定义的与语言相关的停用词列表可以在文档 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-stop-tokenfilter.html" class="ulink" target="_top"><code class="literal">stop</code> 词元过滤器</a>​ 中找到。</p>
</div>
</div>

<p>停用词可以通过指定一个特殊列表 <code class="literal">_none_</code> 来禁用。例如，使用 <code class="literal">_english_</code> 分析器但不使用停用词，可以通过以下方式做到：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":      "english", <a id="CO169-1"></a><i class="conum" data-value="1"></i>
          "stopwords": "_none_" <a id="CO169-2"></a><i class="conum" data-value="2"></i>
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO169-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">my_english</code> 分析器是基于 <code class="literal">english</code> 分析器。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO169-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p>但禁用了停用词。</p>
</td>
</tr>
</table>
</div>
<p>最后，停用词还可以使用一行一个单词的格式保存在文件中。此文件必须在集群的所有节点上，并且通过 <code class="literal">stopwords_path</code>  参数设置路径:</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" <a id="CO170-1"></a><i class="conum" data-value="1"></i>
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO170-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>停用词文件的路径，该路径相对于 Elasticsearch 的 <code class="literal">config</code> 目录。</p>
</td>
</tr>
</table>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="stop-token-filter"></a>使用 <code class="literal">stop</code>词元过滤器（<code class="term">token filter</code>）<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/240_Stopwords/20_Using_stopwords.asciidoc">edit</a>
</h3>
</div></div></div>
<p>当你需要自定义一个分析器的时候，可以把<a href="analysis-stop-tokenfilter.html?v=2.4" alt="https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-stop-tokenfilter.html"><code class="literal">stop</code>词元过滤器</a>和一个分词器(<code class="term">tokenizer</code>)及其他若干个词元过滤器(<code class="term">token filter</code>)组合起来.
例如：我们想要创建一个西班牙语的分析器:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
自定义停用词列表
</li>
<li class="listitem">
使用<code class="literal">light_spanish</code>词干提取器(<code class="term">stemmer</code>)
</li>
<li class="listitem">
使用<code class="literal">asciifolding</code>过滤器去掉附加符号(<code class="term">diacritics</code>)， 比如变音符号
</li>
</ul>
</div>
<p>我们可以通过以下设置完成:</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type":        "stop",
          "stopwords": [ "si", "esta", "el", "la" ]  <a id="CO171-1"></a><i class="conum" data-value="1"></i>
        },
        "light_spanish": { <a id="CO171-2"></a><i class="conum" data-value="2"></i>
          "type":     "stemmer",
          "language": "light_spanish"
        }
      },
      "analyzer": {
        "my_spanish": {
          "tokenizer": "spanish",
          "filter": [ <a id="CO171-3"></a><i class="conum" data-value="3"></i>
            "lowercase",
            "asciifolding",
            "spanish_stop",
            "light_spanish"
          ]
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO171-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">stop</code>词元过滤器(<code class="term">token filter</code>)采用与 <code class="literal">standard</code>分析器相同的 <code class="literal">stopwords</code> 和 <code class="literal">stopwords_path</code> 参数。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO171-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p>参见 <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/algorithmic-stemmers.html">算法提取器(<code class="term">algorithmic stemmers</code>)</a>。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO171-3"><i class="conum" data-value="3"></i></a></p>
</td>
<td align="left" valign="top">
<p>过滤器的顺序非常重要，下面会进行解释。</p>
</td>
</tr>
</table>
</div>
<p>我们将 <code class="literal">spanish_stop</code> 过滤器放置在 <code class="literal">asciifolding</code> 过滤器之后。这意味着 <code class="word">esta</code>、<code class="word">ésta</code>、<code class="word">++está++</code>这三个词，先通过 <code class="literal">asciifolding</code> 过滤器过滤掉特殊字符变成了 <code class="word">esta</code> ，随后使用停用词过滤器会将 <code class="word">esta</code> 去除。
如果我们只想移除 <code class="word">esta</code> 和 <code class="word">ésta</code> ，但是 <code class="word">++está++</code> 不想移除，必须将 <code class="literal">spanish_stop</code> 过滤器放置在 <code class="literal">asciifolding</code> 之前，并且需要在停用词中指定 <code class="word">esta</code> 和 <code class="word">ésta</code> 。</p>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="updating-stopwords"></a>更新停用词(<code class="term">Updating Stopwords</code>)<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/240_Stopwords/20_Using_stopwords.asciidoc">edit</a>
</h3>
</div></div></div>
<p>想要更新分析器的停用词列表有多种方式， 分析器在创建索引时，当集群节点重启时候，或者关闭的索引重新打开的时候。</p>
<p>如果你使用 <code class="literal">stopwords</code> 参数以内联方式指定停用词，那么你只能通过关闭索引，更新分析器的配置​<a href="https://www.elastic.co/guide/en/elasticsearch/reference/2.4/indices-update-settings.html#update-settings-analysis" class="ulink" target="_top">update index settings API</a>​，然后在重新打开索引才能更新停用词。</p>
<p>如果你使用 <code class="literal">stopwords_path</code> 参数指定停用词的文件路径 ，那么更新停用词就简单了。你只需更新文件(在每一个集群节点上)，然后通过两者之中的任何一个操作来强制重新创建分析器:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
关闭和重新打开索引
(参考 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/2.4/indices-open-close.html" class="ulink" target="_top">索引的开与关</a>)，
</li>
<li class="listitem">
一一重启集群下的每个节点。
</li>
</ul>
</div>
<p>当然，更新的停用词不会改变任何已经存在的索引。这些停用词的只适用于新的搜索或更新文档。如果要改变现有的文档，则需要重新索引数据。参加 <a class="xref" href="reindex.html" title="重新索引你的数据">重新索引你的数据(<code class="term">reindex</code>)</a> 。</p>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="pros-cons-stopwords.html">« 停用词的优缺点</a>
</span>
<span class="next">
<a href="stopwords-performance.html">停用词与性能 »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>